In this project we use a variety of classification models to attempt to predict the location of mass shooting events. We will do this by analyzing the mental health history and signs of being in a crisis 6 moths prior to the shooting.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import openpyxl
import sys
import seaborn as sns
import plotly.express as px # graphing interactive map from data
sys.setrecursionlimit(10000000)
# Render our plots inline
%matplotlib inline
# Make the graphs a bit prettier, and bigger
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (15, 7)# Start writing code here...# Start writing code here...
mass_shootings = pd.read_excel('Violence-Project-Mass-Shooter-Database-Version-5-May-2022.xlsx', sheet_name='Full Database', header=1)
mass_shootings
| Case # | Shooter Last Name | Shooter First Name | Full Date | Day of Week | Day | Month | Year | Shooting Location Address | City | ... | Performance | Interest in Firearms | Firearm Proficiency | Total Firearms Brought to the Scene | Other Weapons or Gear | Specify Other Weapons or Gear | On-Scene Outcome | Attempt to Flee | Insanity Defense | Criminal Sentence | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Whitman | Charles | 1966-08-01 | Monday | 1 | 8 | 1966 | 110 Inner Campus Drive, Austin, TX 78705 | Austin | ... | 0.0 | 1.0 | 3.0 | 7.0 | 1.0 | hatchet, hammer, knives, wrench, ropes, water,... | 1.0 | 0.0 | 2.0 | 0.0 |
| 1 | 2 | Smith | Robert | 1966-11-12 | Saturday | 12 | 11 | 1966 | Rose-Mar College of Beauty in Mesa, AZ | Mesa | ... | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | knife, nylon cord | 2.0 | 0.0 | 1.0 | 1.0 |
| 2 | 3 | Held | Leo | 1967-10-23 | Monday | 23 | 10 | 1967 | 599 South Highland Street Lockhaven, PA 17745 | Lock Haven | ... | 0.0 | 1.0 | 3.0 | 2.0 | 1.0 | holster | 1.0 | 0.0 | 2.0 | 0.0 |
| 3 | 4 | Pearson | Eric | 1968-03-16 | Saturday | 16 | 3 | 1968 | 11703 Lake Rd, Ironwood, MI 49938 | Ironwood | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | NaN | 2.0 | 0.0 | 0.0 | 3.0 |
| 4 | 5 | Lambright | Donald | 1969-04-05 | Saturday | 5 | 4 | 1969 | Pennsylvania Turnpike near Harrisburg, PA | Harrisburg | ... | 0.0 | 0.0 | 3.0 | 2.0 | 0.0 | NaN | 0.0 | 0.0 | 2.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 176 | 178 | Gaxiola Gonzalez | Aminadab | 2021-03-31 | Wednesday | 31 | 3 | 2021 | 202 West Lincoln Avenue Orange, CA 92865 | Orange | ... | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | pepper spray, handcuffs, ammunition, locked ex... | 2.0 | 0.0 | 3.0 | NaN |
| 177 | 179 | Hole | Brandon Scott | 2021-04-15 | Thursday | 15 | 4 | 2021 | 8951 Mirabel Rd, Indianapolis, IN 46241 | Indianapolis | ... | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | NaN | 0.0 | 0.0 | 2.0 | NaN |
| 178 | 180 | Cassidy | Samuel | 2021-05-26 | Wednesday | 26 | 5 | 2021 | 101 W Younger Ave, San Jose, CA 95110 | San Jose | ... | 0.0 | 1.0 | 0.0 | 3.0 | 1.0 | 32 extended magazines | 0.0 | 0.0 | 2.0 | NaN |
| 179 | 181 | Crumbley | Ethan | 2021-11-30 | Tuesday | 30 | 11 | 2021 | 745 N Oxford Rd, Oxford, MI 48371 | Oxford | ... | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | NaN | 2.0 | 0.0 | 3.0 | NaN |
| 180 | 182 | Gendron | Payton | 2022-05-14 | Saturday | 14 | 5 | 2022 | 1275 Jefferson Ave, Buffalo, NY 14208 | Buffalo | ... | 1.0 | 1.0 | 3.0 | 1.0 | 1.0 | tactical gear, bulletproof vest, helmet | 2.0 | 0.0 | NaN | NaN |
181 rows × 142 columns
my_data = mass_shootings[['Age', 'Gender', 'Race', 'Education','Location', 'City', 'State', 'Region', 'Suicidality',
'Voluntary or Involuntary Hospitalization','Prior Hospitalization', 'Prior Counseling',
'Voluntary or Mandatory Counseling', 'Recent or Ongoing Stressor', 'Signs of Being in Crisis',
'Timeline of Signs of Crisis', 'Leakage ', 'Leakage How', 'Leakage Who ', 'Number Killed', 'Number Injured']]
my_data
| Age | Gender | Race | Education | Location | City | State | Region | Suicidality | Voluntary or Involuntary Hospitalization | ... | Prior Counseling | Voluntary or Mandatory Counseling | Recent or Ongoing Stressor | Signs of Being in Crisis | Timeline of Signs of Crisis | Leakage | Leakage How | Leakage Who | Number Killed | Number Injured | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25.0 | 0.0 | 0.0 | 2.0 | 1 | Austin | TX | 0 | 2.0 | 0.0 | ... | 1.0 | 1 | 4 | 1.0 | 2.0 | 1.0 | 0 | 0 | 15 | 31 |
| 1 | 18.0 | 0.0 | 0.0 | 0.0 | 4 | Mesa | AZ | 3 | 1.0 | 0.0 | ... | 0.0 | 0 | 0 | 1.0 | 3.0 | 0.0 | NaN | NaN | 5 | 2 |
| 2 | 39.0 | 0.0 | 0.0 | 2.0 | 9 | Lock Haven | PA | 2 | 2.0 | 0.0 | ... | 0.0 | 0 | 2 | 1.0 | 2.0 | 0.0 | NaN | NaN | 6 | 6 |
| 3 | 56.0 | 0.0 | 0.0 | NaN | 5 | Ironwood | MI | 0 | 0.0 | 0.0 | ... | 0.0 | 0 | 1 | 0.0 | NaN | 0.0 | NaN | NaN | 7 | 2 |
| 4 | 31.0 | 0.0 | 1.0 | 2.0 | 8 | Harrisburg | PA | 2 | 1.0 | 0.0 | ... | 0.0 | 0 | 2 | 1.0 | 0.0 | 0.0 | NaN | NaN | 4 | 17 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 176 | 44.0 | 0.0 | 2.0 | NaN | 6 | Orange | CA | 3 | 0.0 | 0.0 | ... | 0.0 | 0 | 0 | 0.0 | NaN | 0.0 | NaN | NaN | 4 | 1 |
| 177 | 19.0 | 0.0 | 0.0 | 0.0 | 9 | Indianapolis | IN | 1 | 1.0 | 0.0 | ... | 1.0 | 1 | 2, 4 | 1.0 | 3.0 | 0.0 | NaN | NaN | 8 | 7 |
| 178 | 57.0 | 0.0 | 0.0 | 2.0 | 9 | San Jose | CA | 3 | 1.0 | 0.0 | ... | 0.0 | 0 | 2 | 1.0 | 3.0 | 1.0 | 0, 2 | 2, 9 | 9 | 0 |
| 179 | 15.0 | 0.0 | 0.0 | 0.0 | 0 | Oxford | MI | 1 | 0.0 | 0.0 | ... | 0.0 | 0 | 0 | 1.0 | 2.0 | 1.0 | 5, 3, 4 | 7, 7, 9 | 4 | 7 |
| 180 | 18.0 | 0.0 | 0.0 | 2.0 | 4 | Buffalo | NY | 2 | 1.0 | 2.0 | ... | 1.0 | 2 | NaN | NaN | NaN | 1.0 | 2022-04-04 00:00:00 | 9, 6, 7 | 10 | 3 |
181 rows × 21 columns
Use fillna() to replace nulls with -1 to retain all rows. A work around will include code to exclude rows with -1 values in particular regression models that utilize variables with heavy amounts of nulls so as to not skew data and still have accurate models
my_data = my_data.fillna(value='-1')
my_data
| Age | Gender | Race | Education | Location | City | State | Region | Suicidality | Voluntary or Involuntary Hospitalization | ... | Prior Counseling | Voluntary or Mandatory Counseling | Recent or Ongoing Stressor | Signs of Being in Crisis | Timeline of Signs of Crisis | Leakage | Leakage How | Leakage Who | Number Killed | Number Injured | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25.0 | 0.0 | 0.0 | 2.0 | 1 | Austin | TX | 0 | 2.0 | 0.0 | ... | 1.0 | 1 | 4 | 1.0 | 2.0 | 1.0 | 0 | 0 | 15 | 31 |
| 1 | 18.0 | 0.0 | 0.0 | 0.0 | 4 | Mesa | AZ | 3 | 1.0 | 0.0 | ... | 0.0 | 0 | 0 | 1.0 | 3.0 | 0.0 | -1 | -1 | 5 | 2 |
| 2 | 39.0 | 0.0 | 0.0 | 2.0 | 9 | Lock Haven | PA | 2 | 2.0 | 0.0 | ... | 0.0 | 0 | 2 | 1.0 | 2.0 | 0.0 | -1 | -1 | 6 | 6 |
| 3 | 56.0 | 0.0 | 0.0 | -1 | 5 | Ironwood | MI | 0 | 0.0 | 0.0 | ... | 0.0 | 0 | 1 | 0.0 | -1 | 0.0 | -1 | -1 | 7 | 2 |
| 4 | 31.0 | 0.0 | 1.0 | 2.0 | 8 | Harrisburg | PA | 2 | 1.0 | 0.0 | ... | 0.0 | 0 | 2 | 1.0 | 0.0 | 0.0 | -1 | -1 | 4 | 17 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 176 | 44.0 | 0.0 | 2.0 | -1 | 6 | Orange | CA | 3 | 0.0 | 0.0 | ... | 0.0 | 0 | 0 | 0.0 | -1 | 0.0 | -1 | -1 | 4 | 1 |
| 177 | 19.0 | 0.0 | 0.0 | 0.0 | 9 | Indianapolis | IN | 1 | 1.0 | 0.0 | ... | 1.0 | 1 | 2, 4 | 1.0 | 3.0 | 0.0 | -1 | -1 | 8 | 7 |
| 178 | 57.0 | 0.0 | 0.0 | 2.0 | 9 | San Jose | CA | 3 | 1.0 | 0.0 | ... | 0.0 | 0 | 2 | 1.0 | 3.0 | 1.0 | 0, 2 | 2, 9 | 9 | 0 |
| 179 | 15.0 | 0.0 | 0.0 | 0.0 | 0 | Oxford | MI | 1 | 0.0 | 0.0 | ... | 0.0 | 0 | 0 | 1.0 | 2.0 | 1.0 | 5, 3, 4 | 7, 7, 9 | 4 | 7 |
| 180 | 18.0 | 0.0 | 0.0 | 2.0 | 4 | Buffalo | NY | 2 | 1.0 | 2.0 | ... | 1.0 | 2 | -1 | -1 | -1 | 1.0 | 2022-04-04 00:00:00 | 9, 6, 7 | 10 | 3 |
181 rows × 21 columns
my_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 181 entries, 0 to 180 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 181 non-null object 1 Gender 181 non-null object 2 Race 181 non-null object 3 Education 181 non-null object 4 Location 181 non-null int64 5 City 181 non-null object 6 State 181 non-null object 7 Region 181 non-null int64 8 Suicidality 181 non-null object 9 Voluntary or Involuntary Hospitalization 181 non-null object 10 Prior Hospitalization 181 non-null object 11 Prior Counseling 181 non-null object 12 Voluntary or Mandatory Counseling 181 non-null object 13 Recent or Ongoing Stressor 181 non-null object 14 Signs of Being in Crisis 181 non-null object 15 Timeline of Signs of Crisis 181 non-null object 16 Leakage 181 non-null object 17 Leakage How 181 non-null object 18 Leakage Who 181 non-null object 19 Number Killed 181 non-null int64 20 Number Injured 181 non-null int64 dtypes: int64(4), object(17) memory usage: 29.8+ KB
We can see that there are no null or N/A values. However some rows contain multiple responses for some variables. We need to re-code them to indicate multiple responses were given.
my_data.loc[my_data['Recent or Ongoing Stressor'].str.contains(', ', na=False), 'Recent or Ongoing Stressor'] = 7
my_data.loc[my_data['Voluntary or Mandatory Counseling'].str.contains(', ', na=False), 'Voluntary or Mandatory Counseling'] = 3
my_data.loc[my_data['Leakage How'].str.contains(', ', na=False), 'Leakage How'] = 6
my_data.loc[my_data['Leakage Who '].str.contains(', ', na=False), 'Leakage Who '] = 10
my_data
| Age | Gender | Race | Education | Location | City | State | Region | Suicidality | Voluntary or Involuntary Hospitalization | ... | Prior Counseling | Voluntary or Mandatory Counseling | Recent or Ongoing Stressor | Signs of Being in Crisis | Timeline of Signs of Crisis | Leakage | Leakage How | Leakage Who | Number Killed | Number Injured | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25.0 | 0.0 | 0.0 | 2.0 | 1 | Austin | TX | 0 | 2.0 | 0.0 | ... | 1.0 | 1 | 4 | 1.0 | 2.0 | 1.0 | 0 | 0 | 15 | 31 |
| 1 | 18.0 | 0.0 | 0.0 | 0.0 | 4 | Mesa | AZ | 3 | 1.0 | 0.0 | ... | 0.0 | 0 | 0 | 1.0 | 3.0 | 0.0 | -1 | -1 | 5 | 2 |
| 2 | 39.0 | 0.0 | 0.0 | 2.0 | 9 | Lock Haven | PA | 2 | 2.0 | 0.0 | ... | 0.0 | 0 | 2 | 1.0 | 2.0 | 0.0 | -1 | -1 | 6 | 6 |
| 3 | 56.0 | 0.0 | 0.0 | -1 | 5 | Ironwood | MI | 0 | 0.0 | 0.0 | ... | 0.0 | 0 | 1 | 0.0 | -1 | 0.0 | -1 | -1 | 7 | 2 |
| 4 | 31.0 | 0.0 | 1.0 | 2.0 | 8 | Harrisburg | PA | 2 | 1.0 | 0.0 | ... | 0.0 | 0 | 2 | 1.0 | 0.0 | 0.0 | -1 | -1 | 4 | 17 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 176 | 44.0 | 0.0 | 2.0 | -1 | 6 | Orange | CA | 3 | 0.0 | 0.0 | ... | 0.0 | 0 | 0 | 0.0 | -1 | 0.0 | -1 | -1 | 4 | 1 |
| 177 | 19.0 | 0.0 | 0.0 | 0.0 | 9 | Indianapolis | IN | 1 | 1.0 | 0.0 | ... | 1.0 | 1 | 7 | 1.0 | 3.0 | 0.0 | -1 | -1 | 8 | 7 |
| 178 | 57.0 | 0.0 | 0.0 | 2.0 | 9 | San Jose | CA | 3 | 1.0 | 0.0 | ... | 0.0 | 0 | 2 | 1.0 | 3.0 | 1.0 | 6 | 10 | 9 | 0 |
| 179 | 15.0 | 0.0 | 0.0 | 0.0 | 0 | Oxford | MI | 1 | 0.0 | 0.0 | ... | 0.0 | 0 | 0 | 1.0 | 2.0 | 1.0 | 6 | 10 | 4 | 7 |
| 180 | 18.0 | 0.0 | 0.0 | 2.0 | 4 | Buffalo | NY | 2 | 1.0 | 2.0 | ... | 1.0 | 2 | -1 | -1 | -1 | 1.0 | 2022-04-04 00:00:00 | 10 | 10 | 3 |
181 rows × 21 columns
Lets also re-code the location column: 0= K-12 School ---> 11.
my_data['Location'] = my_data['Location'].replace([0],11)
Row 144 has a lot of missing data, and row 152 is an outlier. Lets drop those rows.
# deleting that row and the vegas outlier
my_data.drop([144], axis=0, inplace = True)
my_data.drop([152], axis=0, inplace = True)
# We can also drop row 180 due to mis-matched data types
my_data.drop([180], axis=0, inplace = True)
my_data.reset_index()
| index | Age | Gender | Race | Education | Location | City | State | Region | Suicidality | ... | Prior Counseling | Voluntary or Mandatory Counseling | Recent or Ongoing Stressor | Signs of Being in Crisis | Timeline of Signs of Crisis | Leakage | Leakage How | Leakage Who | Number Killed | Number Injured | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 25.0 | 0.0 | 0.0 | 2.0 | 1 | Austin | TX | 0 | 2.0 | ... | 1.0 | 1 | 4 | 1.0 | 2.0 | 1.0 | 0 | 0 | 15 | 31 |
| 1 | 1 | 18.0 | 0.0 | 0.0 | 0.0 | 4 | Mesa | AZ | 3 | 1.0 | ... | 0.0 | 0 | 0 | 1.0 | 3.0 | 0.0 | -1 | -1 | 5 | 2 |
| 2 | 2 | 39.0 | 0.0 | 0.0 | 2.0 | 9 | Lock Haven | PA | 2 | 2.0 | ... | 0.0 | 0 | 2 | 1.0 | 2.0 | 0.0 | -1 | -1 | 6 | 6 |
| 3 | 3 | 56.0 | 0.0 | 0.0 | -1 | 5 | Ironwood | MI | 0 | 0.0 | ... | 0.0 | 0 | 1 | 0.0 | -1 | 0.0 | -1 | -1 | 7 | 2 |
| 4 | 4 | 31.0 | 0.0 | 1.0 | 2.0 | 8 | Harrisburg | PA | 2 | 1.0 | ... | 0.0 | 0 | 2 | 1.0 | 0.0 | 0.0 | -1 | -1 | 4 | 17 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 173 | 175 | 21.0 | 0.0 | 4.0 | 1.0 | 4 | Boulder | CO | 3 | 0.0 | ... | 0.0 | 0 | 0 | 1.0 | 3.0 | 0.0 | -1 | -1 | 10 | 1 |
| 174 | 176 | 44.0 | 0.0 | 2.0 | -1 | 6 | Orange | CA | 3 | 0.0 | ... | 0.0 | 0 | 0 | 0.0 | -1 | 0.0 | -1 | -1 | 4 | 1 |
| 175 | 177 | 19.0 | 0.0 | 0.0 | 0.0 | 9 | Indianapolis | IN | 1 | 1.0 | ... | 1.0 | 1 | 7 | 1.0 | 3.0 | 0.0 | -1 | -1 | 8 | 7 |
| 176 | 178 | 57.0 | 0.0 | 0.0 | 2.0 | 9 | San Jose | CA | 3 | 1.0 | ... | 0.0 | 0 | 2 | 1.0 | 3.0 | 1.0 | 6 | 10 | 9 | 0 |
| 177 | 179 | 15.0 | 0.0 | 0.0 | 0.0 | 11 | Oxford | MI | 1 | 0.0 | ... | 0.0 | 0 | 0 | 1.0 | 2.0 | 1.0 | 6 | 10 | 4 | 7 |
178 rows × 22 columns
my_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 178 entries, 0 to 179 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 178 non-null object 1 Gender 178 non-null object 2 Race 178 non-null object 3 Education 178 non-null object 4 Location 178 non-null int64 5 City 178 non-null object 6 State 178 non-null object 7 Region 178 non-null int64 8 Suicidality 178 non-null object 9 Voluntary or Involuntary Hospitalization 178 non-null object 10 Prior Hospitalization 178 non-null object 11 Prior Counseling 178 non-null object 12 Voluntary or Mandatory Counseling 178 non-null object 13 Recent or Ongoing Stressor 178 non-null object 14 Signs of Being in Crisis 178 non-null object 15 Timeline of Signs of Crisis 178 non-null object 16 Leakage 178 non-null object 17 Leakage How 178 non-null object 18 Leakage Who 178 non-null object 19 Number Killed 178 non-null int64 20 Number Injured 178 non-null int64 dtypes: int64(4), object(17) memory usage: 30.6+ KB
Convert data types to integers.
my_data['Age'] = my_data['Age'].astype('int64')
# my_data_hm['Age'] = my_data_hm['Age'].astype('int64')
# my_data_hm['Gender'] = my_data_hm['Gender'].astype('int64')
# my_data_hm['Race'] = my_data_hm['Race'].astype('int64')
# my_data_hm['Education'] = my_data_hm['Education'].astype('int64')
# my_data_hm['Suicidality'] = my_data_hm['Suicidality'].astype('int64')
# my_data_hm['Voluntary or Involuntary Hospitalization'] = my_data_hm['Voluntary or Involuntary Hospitalization'].astype('int64')
# my_data_hm['Prior Hospitalization'] = my_data_hm['Prior Hospitalization'].astype('int64')
# my_data_hm['Prior Counseling'] = my_data_hm['Prior Counseling'].astype('int64')
# my_data_hm['Voluntary or Mandatory Counseling'] = my_data_hm['Voluntary or Mandatory Counseling'].astype('int64')
# my_data_hm['Recent or Ongoing Stressor'] = my_data_hm['Recent or Ongoing Stressor'].astype('int64')
# my_data_hm['Signs of Being in Crisis'] = my_data_hm['Signs of Being in Crisis'].astype('int64')
# my_data_hm['Timeline of Signs of Crisis'] = my_data_hm['Timeline of Signs of Crisis'].astype('int64')
# my_data_hm['Leakage '] = my_data_hm['Leakage '].astype('int64')
# my_data_hm['Leakage How'] = my_data_hm['Leakage How'].astype('int64')
# my_data_hm['Leakage Who '] = my_data_hm['Leakage Who '].astype('int64')
my_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 178 entries, 0 to 179 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 178 non-null int64 1 Gender 178 non-null object 2 Race 178 non-null object 3 Education 178 non-null object 4 Location 178 non-null int64 5 City 178 non-null object 6 State 178 non-null object 7 Region 178 non-null int64 8 Suicidality 178 non-null object 9 Voluntary or Involuntary Hospitalization 178 non-null object 10 Prior Hospitalization 178 non-null object 11 Prior Counseling 178 non-null object 12 Voluntary or Mandatory Counseling 178 non-null object 13 Recent or Ongoing Stressor 178 non-null object 14 Signs of Being in Crisis 178 non-null object 15 Timeline of Signs of Crisis 178 non-null object 16 Leakage 178 non-null object 17 Leakage How 178 non-null object 18 Leakage Who 178 non-null object 19 Number Killed 178 non-null int64 20 Number Injured 178 non-null int64 dtypes: int64(5), object(16) memory usage: 30.6+ KB
Lets add a column to calculate the total Casualties.
# sum deaths and injuries into a new column called casualties
my_data["Casualties"] = my_data["Number Killed"] + my_data["Number Injured"]
my_data.head()
| Age | Gender | Race | Education | Location | City | State | Region | Suicidality | Voluntary or Involuntary Hospitalization | ... | Voluntary or Mandatory Counseling | Recent or Ongoing Stressor | Signs of Being in Crisis | Timeline of Signs of Crisis | Leakage | Leakage How | Leakage Who | Number Killed | Number Injured | Casualties | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 0.0 | 0.0 | 2.0 | 1 | Austin | TX | 0 | 2.0 | 0.0 | ... | 1 | 4 | 1.0 | 2.0 | 1.0 | 0 | 0 | 15 | 31 | 46 |
| 1 | 18 | 0.0 | 0.0 | 0.0 | 4 | Mesa | AZ | 3 | 1.0 | 0.0 | ... | 0 | 0 | 1.0 | 3.0 | 0.0 | -1 | -1 | 5 | 2 | 7 |
| 2 | 39 | 0.0 | 0.0 | 2.0 | 9 | Lock Haven | PA | 2 | 2.0 | 0.0 | ... | 0 | 2 | 1.0 | 2.0 | 0.0 | -1 | -1 | 6 | 6 | 12 |
| 3 | 56 | 0.0 | 0.0 | -1 | 5 | Ironwood | MI | 0 | 0.0 | 0.0 | ... | 0 | 1 | 0.0 | -1 | 0.0 | -1 | -1 | 7 | 2 | 9 |
| 4 | 31 | 0.0 | 1.0 | 2.0 | 8 | Harrisburg | PA | 2 | 1.0 | 0.0 | ... | 0 | 2 | 1.0 | 0.0 | 0.0 | -1 | -1 | 4 | 17 | 21 |
5 rows × 22 columns
# checking that data is cleaned better
my_data.isna().sum()
Age 0 Gender 0 Race 0 Education 0 Location 0 City 0 State 0 Region 0 Suicidality 0 Voluntary or Involuntary Hospitalization 0 Prior Hospitalization 0 Prior Counseling 0 Voluntary or Mandatory Counseling 0 Recent or Ongoing Stressor 0 Signs of Being in Crisis 0 Timeline of Signs of Crisis 0 Leakage 0 Leakage How 0 Leakage Who 0 Number Killed 0 Number Injured 0 Casualties 0 dtype: int64
All N/A's have been removed and the data is sufficiently clean.
my_data_hm = my_data
my_data_hm = my_data_hm.drop(["State", 'City'], axis=1)
my_data_hm = my_data_hm.astype('int64')
my_data_hm.describe()
| Age | Gender | Race | Education | Location | Region | Suicidality | Voluntary or Involuntary Hospitalization | Prior Hospitalization | Prior Counseling | Voluntary or Mandatory Counseling | Recent or Ongoing Stressor | Signs of Being in Crisis | Timeline of Signs of Crisis | Leakage | Leakage How | Leakage Who | Number Killed | Number Injured | Casualties | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 |
| mean | 33.741573 | 0.022472 | 0.848315 | 0.904494 | 5.983146 | 1.438202 | 1.117978 | 0.370787 | 0.191011 | 0.292135 | 0.415730 | 3.061798 | 0.825843 | 1.432584 | 0.443820 | 0.101124 | 1.820225 | 6.910112 | 6.477528 | 13.387640 |
| std | 12.180403 | 0.148631 | 1.408017 | 1.524520 | 2.750397 | 1.284018 | 0.818302 | 0.764790 | 0.394207 | 0.456027 | 0.733523 | 2.774297 | 0.422537 | 1.498947 | 0.498235 | 1.889878 | 3.801759 | 5.432982 | 10.041836 | 13.807444 |
| min | 11.000000 | 0.000000 | -1.000000 | -1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -1.000000 | 0.000000 | -1.000000 | -1.000000 | 4.000000 | 0.000000 | 4.000000 |
| 25% | 24.000000 | 0.000000 | 0.000000 | -1.000000 | 4.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | -1.000000 | -1.000000 | 4.000000 | 1.000000 | 6.000000 |
| 50% | 33.000000 | 0.000000 | 0.000000 | 1.000000 | 6.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 1.000000 | 2.000000 | 0.000000 | -1.000000 | -1.000000 | 5.000000 | 3.000000 | 8.000000 |
| 75% | 42.750000 | 0.000000 | 1.000000 | 2.000000 | 8.000000 | 3.000000 | 2.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 7.000000 | 1.000000 | 3.000000 | 1.000000 | 0.000000 | 4.000000 | 7.000000 | 7.000000 | 15.000000 |
| max | 70.000000 | 1.000000 | 6.000000 | 4.000000 | 11.000000 | 3.000000 | 2.000000 | 2.000000 | 1.000000 | 1.000000 | 3.000000 | 7.000000 | 3.000000 | 3.000000 | 1.000000 | 6.000000 | 10.000000 | 49.000000 | 70.000000 | 102.000000 |
mask = np.zeros_like(my_data_hm.corr())
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize = (24,16))
sns.heatmap(my_data_hm.corr(), mask=mask, annot=True, cmap="RdYlGn", linewidths=.75)
<AxesSubplot:>
my_data.describe()
| Age | Location | Region | Number Killed | Number Injured | Casualties | |
|---|---|---|---|---|---|---|
| count | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 | 178.000000 |
| mean | 33.741573 | 5.983146 | 1.438202 | 6.910112 | 6.477528 | 13.387640 |
| std | 12.180403 | 2.750397 | 1.284018 | 5.432982 | 10.041836 | 13.807444 |
| min | 11.000000 | 1.000000 | 0.000000 | 4.000000 | 0.000000 | 4.000000 |
| 25% | 24.000000 | 4.000000 | 0.000000 | 4.000000 | 1.000000 | 6.000000 |
| 50% | 33.000000 | 6.000000 | 1.000000 | 5.000000 | 3.000000 | 8.000000 |
| 75% | 42.750000 | 8.000000 | 3.000000 | 7.000000 | 7.000000 | 15.000000 |
| max | 70.000000 | 11.000000 | 3.000000 | 49.000000 | 70.000000 | 102.000000 |
We can see that the average number of casualties is 13, and the average age of the shooter is 34 with a standard deviation of 12 years.
my_data
| Age | Gender | Race | Education | Location | City | State | Region | Suicidality | Voluntary or Involuntary Hospitalization | ... | Voluntary or Mandatory Counseling | Recent or Ongoing Stressor | Signs of Being in Crisis | Timeline of Signs of Crisis | Leakage | Leakage How | Leakage Who | Number Killed | Number Injured | Casualties | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 0.0 | 0.0 | 2.0 | 1 | Austin | TX | 0 | 2.0 | 0.0 | ... | 1 | 4 | 1.0 | 2.0 | 1.0 | 0 | 0 | 15 | 31 | 46 |
| 1 | 18 | 0.0 | 0.0 | 0.0 | 4 | Mesa | AZ | 3 | 1.0 | 0.0 | ... | 0 | 0 | 1.0 | 3.0 | 0.0 | -1 | -1 | 5 | 2 | 7 |
| 2 | 39 | 0.0 | 0.0 | 2.0 | 9 | Lock Haven | PA | 2 | 2.0 | 0.0 | ... | 0 | 2 | 1.0 | 2.0 | 0.0 | -1 | -1 | 6 | 6 | 12 |
| 3 | 56 | 0.0 | 0.0 | -1 | 5 | Ironwood | MI | 0 | 0.0 | 0.0 | ... | 0 | 1 | 0.0 | -1 | 0.0 | -1 | -1 | 7 | 2 | 9 |
| 4 | 31 | 0.0 | 1.0 | 2.0 | 8 | Harrisburg | PA | 2 | 1.0 | 0.0 | ... | 0 | 2 | 1.0 | 0.0 | 0.0 | -1 | -1 | 4 | 17 | 21 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 175 | 21 | 0.0 | 4.0 | 1.0 | 4 | Boulder | CO | 3 | 0.0 | 0.0 | ... | 0 | 0 | 1.0 | 3.0 | 0.0 | -1 | -1 | 10 | 1 | 11 |
| 176 | 44 | 0.0 | 2.0 | -1 | 6 | Orange | CA | 3 | 0.0 | 0.0 | ... | 0 | 0 | 0.0 | -1 | 0.0 | -1 | -1 | 4 | 1 | 5 |
| 177 | 19 | 0.0 | 0.0 | 0.0 | 9 | Indianapolis | IN | 1 | 1.0 | 0.0 | ... | 1 | 7 | 1.0 | 3.0 | 0.0 | -1 | -1 | 8 | 7 | 15 |
| 178 | 57 | 0.0 | 0.0 | 2.0 | 9 | San Jose | CA | 3 | 1.0 | 0.0 | ... | 0 | 2 | 1.0 | 3.0 | 1.0 | 6 | 10 | 9 | 0 | 9 |
| 179 | 15 | 0.0 | 0.0 | 0.0 | 11 | Oxford | MI | 1 | 0.0 | 0.0 | ... | 0 | 0 | 1.0 | 2.0 | 1.0 | 6 | 10 | 4 | 7 | 11 |
178 rows × 22 columns
my_data['Casualties'].value_counts()
6 22 4 20 8 19 7 18 5 15 9 14 11 8 10 7 14 4 20 4 36 4 17 4 16 4 15 4 12 4 21 3 45 3 25 2 46 2 35 1 28 1 49 1 33 1 23 1 34 1 48 1 102 1 82 1 40 1 19 1 13 1 58 1 24 1 29 1 27 1 30 1 Name: Casualties, dtype: int64
This shows us that most shootings in the data set have less than 10 casualties.
sns.set_theme(style="darkgrid")
sns.countplot(y=my_data["Location"], data=my_data, palette="mako_r")
plt.ylabel('Location')
plt.xlabel('Total')
plt.yticks([0, 1,2,3,4,5,6,7,8,9,10,], [ 'K-12 school','College/university','Government building / \nplace of civic importance',
'House of worship','Retail','Restaurant/bar/nightclub','Office','Place of residence',
'Outdoors','Warehouse/factory', 'Post office'])
plt.title('Location of Mass Shootings')
plt.show()
fig = px.bar(my_data, x='Casualties',y='Suicidality', height=500, width=600)
fig.update_layout(
template="seaborn",barmode='stack', xaxis={'categoryorder':'total descending'},
title='Total Casualties by Location',
yaxis = dict(
tickmode = 'array',
tickvals = [0, 1, 2],
ticktext = ['No evidence','Yes Prior', 'Not Prior']
)
)
fig
fig = px.histogram(my_data, x='Location',color='Prior Hospitalization', height=600, width=850)
fig.update_layout(
template="seaborn",barmode='group',
title='Distribution of Prior Hospitalization by Location',
xaxis = dict(
tickmode = 'array',
tickvals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
ticktext = ['College/university','Government building / \nplace of civic importance',
'House of worship','Retail','Restaurant/bar/nightclub','Office','Place of residence','Outdoors',
'Warehouse/factory', 'Post office', 'K-12 school']
)
)
fig
There are not many records of prior hospitalization records in the data set. The most number of records come from location 4(Retail) with 6. Location 1(College/Universities) is the only location were the majority of cases have a record of prior hospitalization.
fig = px.histogram(my_data, x='Location',color='Voluntary or Involuntary Hospitalization', height=600, width=850)
fig.update_layout(
template="seaborn",barmode='group',
title='Distribution of Location by Voluntary or Involuntary Hospitalization',
xaxis = dict(
tickmode = 'array',
tickvals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
ticktext = ['College/university','Government building / \nplace of civic importance',
'House of worship','Retail','Restaurant/bar/nightclub','Office','Place of residence',
'Outdoors','Warehouse/factory', 'Post office', 'K-12 school']
)
)
fig
The only locations with a record of prior voluntary hospitalization are Retail, Place of residence, Outdoors, Warehouse/factory.
fig = px.histogram(my_data, x='Location',color='Suicidality', height=600, width=850)
fig.update_layout(
template="seaborn",barmode='group',
title='Distribution of Suicidality by Location',
xaxis = dict(
tickmode = 'array',
tickvals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
ticktext = ['College/university','Government building / \nplace of civic importance',
'House of worship','Retail','Restaurant/bar/nightclub','Office','Place of residence',
'Outdoors','Warehouse/factory', 'Post office', 'K-12 school']
)
)
fig
K-12 school shooters were the most likely to have been suicidal prior to the shooting. Warehouse/factory and Retail shooters were not suicidal prior to the shooting, however they did intend to die in the shooting.
fig = px.histogram(my_data, x='Location',color='Signs of Being in Crisis', height=600, width=850)
fig.update_layout(
template="seaborn",barmode='group',
title='Distribution of Signs of Being in Crisis by Location',
xaxis = dict(
tickmode = 'array',
tickvals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
ticktext = ['College/university','Government building / \nplace of civic importance',
'House of worship','Retail','Restaurant/bar/nightclub','Office','Place of residence',
'Outdoors','Warehouse/factory', 'Post office', 'K-12 school']
)
)
fig
Shooters of Retail, Restaurant/bar/nightclub, Office locations most often displayed signs of being in a crisis.
fig = px.histogram(my_data, x='Location',color='Timeline of Signs of Crisis', height=600, width=850)
fig.update_layout(
template="seaborn",barmode='group',
title='Distribution of Timeline of Signs of Crisis by Location',
xaxis = dict(
tickmode = 'array',
tickvals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
ticktext = ['College/university','Government building / \nplace of civic importance',
'House of worship','Retail','Restaurant/bar/nightclub','Office','Place of residence',
'Outdoors','Warehouse/factory', 'Post office', 'K-12 school']
)
)
fig
Timeline of Signs of Crisis: N/A = -1, Days before shooting = 0, Weeks before shooting = 1, Months before shooting = 2, Years before shooting = 3.
Shooters of retail locations most often displayed signs of being in a crisis years prior. Shooters of outdoor locations are the most implosive showing signs of being in a crisis only days prior to the event.
fig = px.scatter(my_data, x='Location',y='Age', size='Casualties', height=600, width=900)
fig.update_layout(
template="seaborn",barmode='group',
title="Distribution of Number of Casualties by Location and Age",
xaxis = dict(
tickmode = 'array',
tickvals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
ticktext = ['College/university','Government building / \nplace of civic importance',
'House of worship','Retail','Restaurant/bar/nightclub','Office','Place of residence',
'Outdoors','Warehouse/factory', 'Post office', 'K-12 school']
)
)
fig
Most shooter of K-12 schools are under the age of 20.
my_data.loc[(my_data['Prior Hospitalization']==1) & (my_data['Voluntary or Involuntary Hospitalization']==1),
'Location'].value_counts().plot(kind='pie',autopct='%1.1f%%',title='Voluntary Prior Hospitalization')
<AxesSubplot:title={'center':'Voluntary Prior Hospitalization'}, ylabel='Location'>
There are only 1 instances of Voluntary Prior Hospitalization in each of 4 locations: Retail = 4, Place of residence = 7, Outdoors = 8, Warehouse/factory = 9.
my_data.loc[(my_data['Prior Hospitalization']==1) & (my_data['Voluntary or Involuntary Hospitalization']==2),
'Location'].value_counts().plot(kind='pie',autopct='%1.1f%%',title='Involuntary Prior Hospitalization')
<AxesSubplot:title={'center':'Involuntary Prior Hospitalization'}, ylabel='Location'>
Retail = 4 and College/university = 1 have the highest percentage(16.7%) of Involuntary Prior Hospitalization.
plt.pie(my_data['Timeline of Signs of Crisis'].value_counts(),
labels = ['N/A', 'Days', "Weeks", "Months", "Years"], autopct='%1.1f%%')
plt.title('Timeline of Signs of Crisis')
Text(0.5, 1.0, 'Timeline of Signs of Crisis')
80% of shooters showed signs of being in a crisis, with almost 25% just days prior to the shooting.
sns.countplot(data=my_data, x="State", hue="Gender")
plt.title('Number of Shootings By State', fontsize=18)
Text(0.5, 1.0, 'Number of Shootings By State')
The top three states with the most shootings are California, Texas, and Florida.
fig = px.scatter(my_data, x='Recent or Ongoing Stressor',y='Casualties', height=600, width=850)
fig.update_layout(
template="seaborn",barmode='group',
title="Distribution of Number of Casualties by Location and Age",
xaxis = dict(
tickmode = 'array',
tickvals = [0, 1, 2, 3, 4, 5, 6, 7],
ticktext = ['No evidence','Recent break-up','Employment stressor','Economic stressor',
'Family issue','Legal issue','Other','Multiple Stressors']
)
)
fig
Recent or Ongoing Stressors: No evidence = 0, Recent break-up = 1, Employment stressors = 2, Economic stressors = 3, Family issue = 4, Legal issue = 5, Other = 6, Multiple = 7.
Employment Stressors are the most common stressors. Economic stressors are the least common.
Lets make a data frame of the total number of casualties per state.
state_shootings = my_data[['State', 'Casualties']]
state_shootings = state_shootings.groupby('State').sum('Casualties').reset_index()
state_shootings
| State | Casualties | |
|---|---|---|
| 0 | AK | 21 |
| 1 | AL | 5 |
| 2 | AR | 36 |
| 3 | AZ | 30 |
| 4 | CA | 380 |
| 5 | CO | 186 |
| 6 | CT | 42 |
| 7 | DC | 20 |
| 8 | FL | 273 |
| 9 | GA | 41 |
| 10 | HI | 7 |
| 11 | IA | 6 |
| 12 | ID | 4 |
| 13 | IL | 53 |
| 14 | IN | 24 |
| 15 | KS | 7 |
| 16 | KY | 44 |
| 17 | LA | 29 |
| 18 | MA | 7 |
| 19 | MD | 8 |
| 20 | MI | 42 |
| 21 | MN | 24 |
| 22 | MO | 14 |
| 23 | MS | 26 |
| 24 | NC | 50 |
| 25 | NE | 13 |
| 26 | NH | 8 |
| 27 | NJ | 36 |
| 28 | NV | 16 |
| 29 | NY | 87 |
| 30 | OH | 60 |
| 31 | OK | 20 |
| 32 | OR | 77 |
| 33 | PA | 82 |
| 34 | RI | 4 |
| 35 | SC | 16 |
| 36 | TN | 15 |
| 37 | TX | 381 |
| 38 | UT | 9 |
| 39 | VA | 74 |
| 40 | WA | 69 |
| 41 | WI | 37 |
Now lets plot the the total number of casualties per state.
fig = px.choropleth(state_shootings,
locations="State", # DataFrame column with locations
color="Casualties", # DataFrame column with color values
hover_name="State", # DataFrame column hover info
locationmode = 'USA-states') # Set to plot as US States
fig.update_layout(title_text = 'Mass Shoting Casualties by State', geo_scope='usa',)
fig.show()
We can see that the states with the most casualties are Texas(381) followed by California(380), Florida(273), and Colorado(186). Note: This does not take into account the number of shootings or the states' population.
# group by race
my_data.groupby("Location").size().sort_values()
Location 10 4 1 9 2 9 3 11 8 14 11 14 7 15 6 18 5 25 9 25 4 34 dtype: int64
fig = plt.figure()
# Divide the figure into a 1x2 grid, and give me the first section
ax1 = fig.add_subplot(121)
# Divide the figure into a 1x2 grid, and give me the second section
ax2 = fig.add_subplot(122)
s=my_data.Location.value_counts(normalize=True).mul(100) # mul(100) is == *100
s.index.name,s.name='Location','percentage' #setting the name of index and series
#series.to_frame() returns a dataframe
s.to_frame().plot(kind='bar', ax=ax1, ylim=[0,100])
s=my_data.Race.value_counts(normalize=True).mul(100) # mul(100) is == *100
s.index.name,s.name='Race','percentage' #setting the name of index and series
#series.to_frame() returns a dataframe
s.to_frame().plot(kind='bar', ax = ax2, ylim=[0,100], width=0.15)
<AxesSubplot:xlabel='Race'>
Retail stores are the most common locations for mass shootings. And the most common race of the shooter is White.
my_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 178 entries, 0 to 179 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 178 non-null int64 1 Gender 178 non-null object 2 Race 178 non-null object 3 Education 178 non-null object 4 Location 178 non-null int64 5 City 178 non-null object 6 State 178 non-null object 7 Region 178 non-null int64 8 Suicidality 178 non-null object 9 Voluntary or Involuntary Hospitalization 178 non-null object 10 Prior Hospitalization 178 non-null object 11 Prior Counseling 178 non-null object 12 Voluntary or Mandatory Counseling 178 non-null object 13 Recent or Ongoing Stressor 178 non-null object 14 Signs of Being in Crisis 178 non-null object 15 Timeline of Signs of Crisis 178 non-null object 16 Leakage 178 non-null object 17 Leakage How 178 non-null object 18 Leakage Who 178 non-null object 19 Number Killed 178 non-null int64 20 Number Injured 178 non-null int64 21 Casualties 178 non-null int64 dtypes: int64(6), object(16) memory usage: 32.0+ KB
model_df = my_data[['Age', 'Gender', 'Race', 'Location', 'Suicidality', 'Voluntary or Involuntary Hospitalization','Prior Hospitalization',
'Prior Counseling', 'Voluntary or Mandatory Counseling', 'Recent or Ongoing Stressor',
'Signs of Being in Crisis','Timeline of Signs of Crisis', 'Leakage ', 'Leakage How',
'Leakage Who ', 'Number Killed', 'Number Injured', 'Casualties']]
bin_age = [0, 19, 29, 39, 49, 59, 69, 80]
category_age = ['<20s', '20s', '30s', '40s', '50s', '60s', '>60s']
model_df['Age_binned'] = pd.cut(model_df['Age'], bins=bin_age, labels=category_age)
model_df = model_df.drop(['Age'], axis = 1)
/var/folders/24/536gs7r91qzd964t2ppqhs6m0000gn/T/ipykernel_31475/2989347415.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
bin_Number_Killed = [0, 4, 9, 50]
category_Number_Killed = ['Low', 'Medium', 'High']
model_df['Number_Killed_binned'] = pd.cut(model_df['Number Killed'], bins=bin_Number_Killed, labels=category_Number_Killed)
model_df = model_df.drop(['Number Killed'], axis = 1)
bin_Number_Injured = [0, 9, 29, 70]
category_Number_Injured = ['Low', 'Medium', 'High']
model_df['Number_Injured_binned'] = pd.cut(model_df['Number Injured'], bins=bin_Number_Injured, labels=category_Number_Injured)
model_df = model_df.drop(['Number Injured'], axis = 1)
bin_Casualties = [0, 14, 44, 110]
category_Casualties = ['Low', 'Medium', 'High']
model_df['Casualties_binned'] = pd.cut(model_df['Casualties'], bins=bin_Casualties, labels=category_Casualties)
model_df = model_df.drop(['Casualties'], axis = 1)
We need to seperate the response variable from the predictor variables.
X = model_df.drop(["Location"], axis=1)
y = model_df["Location"]
X = pd.get_dummies(X)
/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/pandas/core/algorithms.py:798: FutureWarning: In a future version, the Index constructor will not infer numeric dtypes when passed object-dtype sequences (matching Series behavior)
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
To increase the accuracy of our models, because we are using a small data set, we will split 33% test sample and 67% training sample.
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 0)
print('The Shape Of The Original Data: ', model_df.shape)
print('The Shape Of x_test: ', x_test.shape)
print('The Shape Of x_train: ', x_train.shape)
print('The Shape Of y_test: ', y_test.shape)
print('The Shape Of y_train: ', y_train.shape)
The Shape Of The Original Data: (178, 18) The Shape Of x_test: (59, 78) The Shape Of x_train: (119, 78) The Shape Of y_test: (59,) The Shape Of y_train: (119,)
This confirms that the test sample is 33% of the original data set.
SMOTE or Synthetic Minority Oversampling Technique is an oversampling technique but SMOTE working differently than your typical oversampling.
In a classic oversampling technique, the minority data is duplicated from the minority data population. While it increases the number of data, it does not give any new information or variation to the machine learning model.
SMOTE works by utilizing a k-nearest neighbor algorithm to create synthetic data. SMOTE first start by choosing random data from the minority class, then k-nearest neighbors from the data are set. Synthetic data would then be made between the random data and the randomly selected k-nearest neighbor.
sns.set_theme(style="darkgrid")
sns.countplot(y=y_train, data=model_df, palette="mako_r")
plt.ylabel('Location')
plt.xlabel('Total')
plt.yticks([0, 1,2,3,4,5,6,7,8,9,10,], [ 'K-12 school','College/university','Government building / \nplace of civic importance',
'House of worship','Retail','Restaurant/bar/nightclub','Office','Place of residence',
'Outdoors','Warehouse/factory', 'Post office'])
plt.title('Unbalanced Data')
plt.show()
This graph shows us that the training data set in not balanced.
from imblearn.over_sampling import SMOTE
x_train, y_train = SMOTE(k_neighbors=1).fit_resample(x_train, y_train)
sns.set_theme(style="darkgrid")
sns.countplot(y=y_train, data=model_df, palette="mako_r")
plt.ylabel('Location')
plt.xlabel('Total')
plt.yticks([0, 1,2,3,4,5,6,7,8,9,10,], [ 'K-12 school','College/university','Government building / \nplace of civic importance',
'House of worship','Retail','Restaurant/bar/nightclub','Office','Place of residence',
'Outdoors','Warehouse/factory', 'Post office'])
plt.title('Balanced Data')
plt.show()
This shows us that the training set has been balanced to the distribution of Location.
from sklearn.linear_model import LogisticRegression
LRclassifier = LogisticRegression(solver='liblinear', max_iter=5000)
LRclassifier.fit(x_train, y_train)
y_pred = LRclassifier.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
from sklearn.metrics import accuracy_score
LRAcc = accuracy_score(y_pred,y_test)
print('Logistic Regression accuracy is: {:.2f}%'.format(LRAcc*100))
precision recall f1-score support
1 0.00 0.00 0.00 4
2 0.00 0.00 0.00 5
3 0.00 0.00 0.00 7
4 0.24 0.62 0.34 8
5 0.13 0.33 0.19 6
6 0.00 0.00 0.00 10
7 0.00 0.00 0.00 2
8 0.50 0.17 0.25 6
9 0.43 0.43 0.43 7
11 0.43 0.75 0.55 4
accuracy 0.24 59
macro avg 0.17 0.23 0.18 59
weighted avg 0.18 0.24 0.18 59
[[0 0 0 2 1 0 0 0 0 1]
[1 0 0 3 1 0 0 0 0 0]
[1 0 0 1 1 0 2 1 0 1]
[0 0 0 5 3 0 0 0 0 0]
[0 0 1 2 2 0 0 0 0 1]
[0 1 0 3 3 0 0 0 3 0]
[0 0 0 1 0 0 0 0 1 0]
[0 0 0 2 3 0 0 1 0 0]
[0 0 0 2 0 1 0 0 3 1]
[0 0 0 0 1 0 0 0 0 3]]
Logistic Regression accuracy is: 23.73%
from sklearn.neighbors import KNeighborsClassifier
KNclassifier = KNeighborsClassifier(n_neighbors=20)
KNclassifier.fit(x_train, y_train)
y_pred = KNclassifier.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
KNAcc = accuracy_score(y_pred,y_test)
print('K Neighbours accuracy is: {:.2f}%'.format(KNAcc*100))
precision recall f1-score support
1 1.00 0.25 0.40 4
2 0.00 0.00 0.00 5
3 0.00 0.00 0.00 7
4 0.17 0.50 0.25 8
5 0.12 0.33 0.18 6
6 0.00 0.00 0.00 10
7 0.00 0.00 0.00 2
8 0.00 0.00 0.00 6
9 0.25 0.43 0.32 7
11 0.50 0.50 0.50 4
accuracy 0.20 59
macro avg 0.20 0.20 0.16 59
weighted avg 0.17 0.20 0.15 59
[[1 0 0 1 0 0 0 0 1 1]
[0 0 0 3 1 0 0 0 1 0]
[0 0 0 2 1 0 2 0 1 1]
[0 0 0 4 3 0 0 0 1 0]
[0 0 0 4 2 0 0 0 0 0]
[0 0 0 2 5 0 0 0 3 0]
[0 0 0 0 0 0 0 0 2 0]
[0 0 0 3 3 0 0 0 0 0]
[0 0 0 3 1 0 0 0 3 0]
[0 0 0 2 0 0 0 0 0 2]]
K Neighbours accuracy is: 20.34%
/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
from sklearn.svm import SVC
SVCclassifier = SVC(kernel='linear', max_iter=251)
SVCclassifier.fit(x_train, y_train)
y_pred = SVCclassifier.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
SVCAcc = accuracy_score(y_pred,y_test)
print('SVC accuracy is: {:.2f}%'.format(SVCAcc*100))
precision recall f1-score support
1 0.50 0.75 0.60 4
2 0.00 0.00 0.00 5
3 0.00 0.00 0.00 7
4 0.22 0.50 0.31 8
5 0.09 0.17 0.12 6
6 0.20 0.10 0.13 10
7 0.00 0.00 0.00 2
8 0.00 0.00 0.00 6
9 0.50 0.43 0.46 7
10 0.00 0.00 0.00 0
11 0.43 0.75 0.55 4
accuracy 0.25 59
macro avg 0.18 0.25 0.20 59
weighted avg 0.20 0.25 0.21 59
[[3 0 0 0 0 0 0 0 0 0 1]
[2 0 0 2 1 0 0 0 0 0 0]
[1 0 0 1 0 1 2 0 1 0 1]
[0 0 1 4 2 1 0 0 0 0 0]
[0 0 1 2 1 2 0 0 0 0 0]
[0 1 0 3 2 1 0 0 2 0 1]
[0 0 0 2 0 0 0 0 0 0 0]
[0 0 0 2 4 0 0 0 0 0 0]
[0 0 0 2 0 0 0 0 3 1 1]
[0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 3]]
SVC accuracy is: 25.42%
/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/svm/_base.py:301: ConvergenceWarning: Solver terminated early (max_iter=251). Consider pre-processing your data with StandardScaler or MinMaxScaler. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
from sklearn.naive_bayes import GaussianNB
NBclassifier2 = GaussianNB()
NBclassifier2.fit(x_train, y_train)
y_pred = NBclassifier2.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
NBAcc2 = accuracy_score(y_pred,y_test)
print('Gaussian Naive Bayes accuracy is: {:.2f}%'.format(NBAcc2*100))
precision recall f1-score support
1 0.33 0.25 0.29 4
2 0.00 0.00 0.00 5
3 0.00 0.00 0.00 7
4 0.10 0.25 0.14 8
5 0.10 0.33 0.15 6
6 0.00 0.00 0.00 10
7 0.00 0.00 0.00 2
8 0.00 0.00 0.00 6
9 0.43 0.43 0.43 7
10 0.00 0.00 0.00 0
11 0.00 0.00 0.00 4
accuracy 0.14 59
macro avg 0.09 0.11 0.09 59
weighted avg 0.10 0.14 0.10 59
[[1 0 0 1 1 0 0 0 0 0 1]
[0 0 0 4 1 0 0 0 0 0 0]
[1 0 0 1 5 0 0 0 0 0 0]
[0 1 0 2 3 0 0 0 1 1 0]
[1 0 0 2 2 0 0 0 1 0 0]
[0 1 0 3 3 0 0 0 2 1 0]
[0 0 0 2 0 0 0 0 0 0 0]
[0 0 0 2 4 0 0 0 0 0 0]
[0 0 0 1 0 0 2 1 3 0 0]
[0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 3 1 0 0 0 0 0 0]]
Gaussian Naive Bayes accuracy is: 13.56%
/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
DTclassifier = DecisionTreeClassifier(max_leaf_nodes=20)
DTclassifier = DTclassifier.fit(x_train, y_train)
y_pred = DTclassifier.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
DTAcc = accuracy_score(y_pred,y_test)
print('Decision Tree accuracy is: {:.2f}%'.format(DTAcc*100))
precision recall f1-score support
1 0.00 0.00 0.00 4
2 0.00 0.00 0.00 5
3 0.00 0.00 0.00 7
4 0.15 0.62 0.24 8
5 0.08 0.17 0.11 6
6 0.00 0.00 0.00 10
7 0.00 0.00 0.00 2
8 0.00 0.00 0.00 6
9 0.50 0.29 0.36 7
10 0.00 0.00 0.00 0
11 0.60 0.75 0.67 4
accuracy 0.19 59
macro avg 0.12 0.17 0.13 59
weighted avg 0.13 0.19 0.13 59
[[0 0 0 4 0 0 0 0 0 0 0]
[0 0 0 2 2 0 0 0 0 1 0]
[0 0 0 5 1 0 0 0 1 0 0]
[0 0 0 5 2 0 0 0 0 0 1]
[0 0 0 4 1 0 0 0 0 1 0]
[0 0 0 5 3 0 0 0 1 1 0]
[0 0 0 2 0 0 0 0 0 0 0]
[0 0 0 3 3 0 0 0 0 0 0]
[0 0 0 2 1 0 0 0 2 1 1]
[0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 3]]
Decision Tree accuracy is: 18.64%
/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
fig, axes = plt.subplots(nrows = 1,ncols = 1, figsize = (20,20), dpi=600)
tree.plot_tree(DTclassifier, max_depth = 20, feature_names = X.columns, filled=True)
plt.show()
Decision trees place the nodes with least entropy (most information) at the top of the tree. Meaning that these features are yielding significantly more information than the other features.
We can verify this further by creating a feature importance plot.
fi = DTclassifier.feature_importances_ #feature importance array
fi = pd.Series(data = fi, index = X.columns) #convert to Pandas series for plotting
fi.sort_values(ascending=False, inplace=True) #sort descending
#create bar plot
plt.figure(figsize=(20, 20))
chart = sns.barplot(x=fi, y=fi.index, palette=sns.color_palette("BuGn_r", n_colors=len(fi)))
chart.set_xticklabels(chart.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.show()
/var/folders/24/536gs7r91qzd964t2ppqhs6m0000gn/T/ipykernel_31475/2241717128.py:4: UserWarning: FixedFormatter should only be used together with FixedLocator
This graph shows us the most important features from our model. In future models we should drop the unimportant variables.
from sklearn.ensemble import RandomForestClassifier
RFclassifier = RandomForestClassifier(max_leaf_nodes=30)
RFclassifier.fit(x_train, y_train)
y_pred = RFclassifier.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
RFAcc = accuracy_score(y_pred,y_test)
print('Random Forest accuracy is: {:.2f}%'.format(RFAcc*100))
precision recall f1-score support
1 0.50 0.25 0.33 4
2 0.00 0.00 0.00 5
3 0.00 0.00 0.00 7
4 0.17 0.25 0.20 8
5 0.08 0.17 0.11 6
6 0.17 0.10 0.12 10
7 0.00 0.00 0.00 2
8 0.33 0.17 0.22 6
9 0.43 0.43 0.43 7
11 0.38 0.75 0.50 4
accuracy 0.20 59
macro avg 0.20 0.21 0.19 59
weighted avg 0.20 0.20 0.19 59
[[1 0 0 0 0 1 1 0 0 1]
[1 0 0 2 1 0 0 0 1 0]
[0 0 0 0 0 1 5 0 0 1]
[0 0 0 2 4 1 0 0 0 1]
[0 0 0 2 1 1 0 1 0 1]
[0 1 0 2 3 1 0 0 3 0]
[0 0 0 1 0 0 0 1 0 0]
[0 0 0 0 4 0 1 1 0 0]
[0 0 0 2 0 1 0 0 3 1]
[0 0 0 1 0 0 0 0 0 3]]
Random Forest accuracy is: 20.34%
/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
compare = pd.DataFrame({'Model': ['Logistic Regression', 'K Neighbors', 'SVM', 'Gaussian NB', 'Decision Tree', 'Random Forest'],
'Accuracy': [LRAcc*100, KNAcc*100, SVCAcc*100, NBAcc2*100, DTAcc*100, RFAcc*100]})
compare.sort_values(by='Accuracy', ascending=False)
| Model | Accuracy | |
|---|---|---|
| 2 | SVM | 25.423729 |
| 0 | Logistic Regression | 23.728814 |
| 1 | K Neighbors | 20.338983 |
| 5 | Random Forest | 20.338983 |
| 4 | Decision Tree | 18.644068 |
| 3 | Gaussian NB | 13.559322 |
NBAcc1*100, 'Categorical NB',
sns.set_theme(style="darkgrid")
sns.barplot(data=compare.sort_values(by='Accuracy', ascending=False), x='Model', y='Accuracy', palette="mako_r")
plt.ylabel('Accuracy Percentage')
plt.xlabel('Model')
plt.title('Model Accuracy')
plt.show()
fig = px.bar(compare.sort_values(by='Accuracy', ascending=False), x='Model', y='Accuracy')
fig.update_layout(
template="seaborn", xaxis={'categoryorder':'total descending'},
title='Model Accuracy')
fig.show()
Lets run the models again, this time dropping the unimportant features.
X.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 178 entries, 0 to 179 Data columns (total 78 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Gender_0.0 178 non-null uint8 1 Gender_1.0 178 non-null uint8 2 Race_0.0 178 non-null uint8 3 Race_1.0 178 non-null uint8 4 Race_2.0 178 non-null uint8 5 Race_3.0 178 non-null uint8 6 Race_4.0 178 non-null uint8 7 Race_5.0 178 non-null uint8 8 Race_6.0 178 non-null uint8 9 Race_-1 178 non-null uint8 10 Suicidality_0.0 178 non-null uint8 11 Suicidality_1.0 178 non-null uint8 12 Suicidality_2.0 178 non-null uint8 13 Voluntary or Involuntary Hospitalization_0.0 178 non-null uint8 14 Voluntary or Involuntary Hospitalization_1.0 178 non-null uint8 15 Voluntary or Involuntary Hospitalization_2.0 178 non-null uint8 16 Prior Hospitalization_0.0 178 non-null uint8 17 Prior Hospitalization_1.0 178 non-null uint8 18 Prior Counseling_0.0 178 non-null uint8 19 Prior Counseling_1.0 178 non-null uint8 20 Voluntary or Mandatory Counseling_0 178 non-null uint8 21 Voluntary or Mandatory Counseling_1 178 non-null uint8 22 Voluntary or Mandatory Counseling_2 178 non-null uint8 23 Voluntary or Mandatory Counseling_3 178 non-null uint8 24 Recent or Ongoing Stressor_0 178 non-null uint8 25 Recent or Ongoing Stressor_1 178 non-null uint8 26 Recent or Ongoing Stressor_2 178 non-null uint8 27 Recent or Ongoing Stressor_3 178 non-null uint8 28 Recent or Ongoing Stressor_4 178 non-null uint8 29 Recent or Ongoing Stressor_5 178 non-null uint8 30 Recent or Ongoing Stressor_6 178 non-null uint8 31 Recent or Ongoing Stressor_7 178 non-null uint8 32 Signs of Being in Crisis_0.0 178 non-null uint8 33 Signs of Being in Crisis_1.0 178 non-null uint8 34 Signs of Being in Crisis_3.0 178 non-null uint8 35 Timeline of Signs of Crisis_0.0 178 non-null uint8 36 Timeline of Signs of Crisis_1.0 178 non-null uint8 37 Timeline of Signs of Crisis_2.0 178 non-null uint8 38 Timeline of Signs of Crisis_3.0 178 non-null uint8 39 Timeline of Signs of Crisis_-1 178 non-null uint8 40 Leakage _0.0 178 non-null uint8 41 Leakage _1.0 178 non-null uint8 42 Leakage How_0 178 non-null uint8 43 Leakage How_1 178 non-null uint8 44 Leakage How_2 178 non-null uint8 45 Leakage How_3 178 non-null uint8 46 Leakage How_4 178 non-null uint8 47 Leakage How_5 178 non-null uint8 48 Leakage How_6 178 non-null uint8 49 Leakage How_-1 178 non-null uint8 50 Leakage Who _10 178 non-null uint8 51 Leakage Who _-1 178 non-null uint8 52 Leakage Who _0 178 non-null uint8 53 Leakage Who _1 178 non-null uint8 54 Leakage Who _2 178 non-null uint8 55 Leakage Who _3 178 non-null uint8 56 Leakage Who _4 178 non-null uint8 57 Leakage Who _5 178 non-null uint8 58 Leakage Who _6 178 non-null uint8 59 Leakage Who _7 178 non-null uint8 60 Leakage Who _8 178 non-null uint8 61 Leakage Who _9 178 non-null uint8 62 Age_binned_<20s 178 non-null uint8 63 Age_binned_20s 178 non-null uint8 64 Age_binned_30s 178 non-null uint8 65 Age_binned_40s 178 non-null uint8 66 Age_binned_50s 178 non-null uint8 67 Age_binned_60s 178 non-null uint8 68 Age_binned_>60s 178 non-null uint8 69 Number_Killed_binned_Low 178 non-null uint8 70 Number_Killed_binned_Medium 178 non-null uint8 71 Number_Killed_binned_High 178 non-null uint8 72 Number_Injured_binned_Low 178 non-null uint8 73 Number_Injured_binned_Medium 178 non-null uint8 74 Number_Injured_binned_High 178 non-null uint8 75 Casualties_binned_Low 178 non-null uint8 76 Casualties_binned_Medium 178 non-null uint8 77 Casualties_binned_High 178 non-null uint8 dtypes: uint8(78) memory usage: 14.9 KB
X = X[['Age_binned_<20s', 'Voluntary or Mandatory Counseling_1', 'Casualties_binned_Low', 'Leakage How_-1',
'Suicidality_1.0', 'Number_Killed_binned_High', 'Age_binned_20s', 'Suicidality_2.0', 'Recent or Ongoing Stressor_7',
'Voluntary or Involuntary Hospitalization_2.0', 'Number_Injured_binned_Low', 'Race_-1', 'Timeline of Signs of Crisis_3.0',
'Gender_1.0', 'Recent or Ongoing Stressor_3', 'Race_1.0', 'Voluntary or Involuntary Hospitalization_0.0']]
X.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 178 entries, 0 to 179 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age_binned_<20s 178 non-null uint8 1 Voluntary or Mandatory Counseling_1 178 non-null uint8 2 Casualties_binned_Low 178 non-null uint8 3 Leakage How_-1 178 non-null uint8 4 Suicidality_1.0 178 non-null uint8 5 Number_Killed_binned_High 178 non-null uint8 6 Age_binned_20s 178 non-null uint8 7 Suicidality_2.0 178 non-null uint8 8 Recent or Ongoing Stressor_7 178 non-null uint8 9 Voluntary or Involuntary Hospitalization_2.0 178 non-null uint8 10 Number_Injured_binned_Low 178 non-null uint8 11 Race_-1 178 non-null uint8 12 Timeline of Signs of Crisis_3.0 178 non-null uint8 13 Gender_1.0 178 non-null uint8 14 Recent or Ongoing Stressor_3 178 non-null uint8 15 Race_1.0 178 non-null uint8 16 Voluntary or Involuntary Hospitalization_0.0 178 non-null uint8 dtypes: uint8(17) memory usage: 4.3 KB
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 0)
print('The Shape Of The Original Data: ', model_df.shape)
print('The Shape Of x_test: ', x_test.shape)
print('The Shape Of x_train: ', x_train.shape)
print('The Shape Of y_test: ', y_test.shape)
print('The Shape Of y_train: ', y_train.shape)
The Shape Of The Original Data: (178, 18) The Shape Of x_test: (59, 17) The Shape Of x_train: (119, 17) The Shape Of y_test: (59,) The Shape Of y_train: (119,)
sns.set_theme(style="darkgrid")
sns.countplot(y=y_train, data=model_df, palette="mako_r")
plt.ylabel('Location')
plt.xlabel('Total')
plt.yticks([0, 1,2,3,4,5,6,7,8,9,10,], [ 'K-12 school','College/university','Government building / \nplace of civic importance',
'House of worship','Retail','Restaurant/bar/nightclub','Office','Place of residence',
'Outdoors','Warehouse/factory', 'Post office'])
plt.title('Unbalanced Data')
plt.show()
x_train, y_train = SMOTE(k_neighbors=1).fit_resample(x_train, y_train)
sns.set_theme(style="darkgrid")
sns.countplot(y=y_train, data=model_df, palette="mako_r")
plt.ylabel('Location')
plt.xlabel('Total')
plt.yticks([0, 1,2,3,4,5,6,7,8,9,10,], [ 'K-12 school','College/university','Government building / \nplace of civic importance',
'House of worship','Retail','Restaurant/bar/nightclub','Office','Place of residence',
'Outdoors','Warehouse/factory', 'Post office'])
plt.title('Balanced Data')
plt.show()
LRclassifier = LogisticRegression(solver='liblinear', max_iter=5000)
LRclassifier.fit(x_train, y_train)
y_pred = LRclassifier.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
from sklearn.metrics import accuracy_score
LRAcc = accuracy_score(y_pred,y_test)
print('Logistic Regression accuracy is: {:.2f}%'.format(LRAcc*100))
precision recall f1-score support
1 0.00 0.00 0.00 4
2 0.20 0.20 0.20 5
3 0.00 0.00 0.00 7
4 0.17 0.12 0.14 8
5 0.22 0.33 0.27 6
6 0.00 0.00 0.00 10
7 0.00 0.00 0.00 2
8 0.00 0.00 0.00 6
9 0.20 0.57 0.30 7
11 0.30 0.75 0.43 4
accuracy 0.19 59
macro avg 0.11 0.20 0.13 59
weighted avg 0.11 0.19 0.13 59
[[0 0 0 1 0 0 0 0 1 2]
[0 1 0 0 1 0 1 0 1 1]
[0 1 0 0 0 0 1 0 3 2]
[1 0 0 1 3 0 0 0 3 0]
[0 1 0 2 2 0 0 0 1 0]
[0 1 0 1 1 0 2 0 5 0]
[0 0 0 1 0 0 0 0 1 0]
[0 1 0 0 2 1 1 0 1 0]
[0 0 0 0 0 0 1 0 4 2]
[0 0 0 0 0 0 1 0 0 3]]
Logistic Regression accuracy is: 18.64%
/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
KNclassifier = KNeighborsClassifier(n_neighbors=20)
KNclassifier.fit(x_train, y_train)
y_pred = KNclassifier.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
KNAcc = accuracy_score(y_pred,y_test)
print('K Neighbours accuracy is: {:.2f}%'.format(KNAcc*100))
precision recall f1-score support
1 0.00 0.00 0.00 4
2 0.11 0.20 0.14 5
3 0.00 0.00 0.00 7
4 0.00 0.00 0.00 8
5 0.11 0.33 0.16 6
6 0.00 0.00 0.00 10
7 0.00 0.00 0.00 2
8 0.00 0.00 0.00 6
9 0.11 0.14 0.12 7
11 0.40 0.50 0.44 4
accuracy 0.10 59
macro avg 0.07 0.12 0.09 59
weighted avg 0.06 0.10 0.07 59
[[0 0 0 0 1 1 0 0 1 1]
[0 1 0 2 1 0 0 0 1 0]
[0 2 0 1 1 1 1 0 1 0]
[0 0 0 0 3 1 1 1 2 0]
[0 3 0 0 2 0 0 0 1 0]
[0 2 0 0 5 0 1 0 2 0]
[0 0 0 1 0 1 0 0 0 0]
[0 0 0 0 5 0 1 0 0 0]
[0 0 0 0 1 0 3 0 1 2]
[0 1 0 0 0 0 0 1 0 2]]
K Neighbours accuracy is: 10.17%
/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
SVCclassifier = SVC(kernel='linear', max_iter=251)
SVCclassifier.fit(x_train, y_train)
y_pred = SVCclassifier.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
SVCAcc = accuracy_score(y_pred,y_test)
print('SVC accuracy is: {:.2f}%'.format(SVCAcc*100))
precision recall f1-score support
1 0.00 0.00 0.00 4
2 0.00 0.00 0.00 5
3 0.00 0.00 0.00 7
4 0.17 0.38 0.23 8
5 0.50 0.17 0.25 6
6 0.00 0.00 0.00 10
7 0.00 0.00 0.00 2
8 0.00 0.00 0.00 6
9 0.15 0.57 0.24 7
11 0.50 0.75 0.60 4
accuracy 0.19 59
macro avg 0.13 0.19 0.13 59
weighted avg 0.12 0.19 0.13 59
[[0 0 0 1 0 0 0 0 2 1]
[0 0 0 3 0 0 0 0 2 0]
[0 0 0 1 0 0 1 0 4 1]
[0 0 0 3 0 1 1 0 3 0]
[0 0 0 3 1 0 1 0 1 0]
[0 1 0 2 0 0 0 0 7 0]
[0 0 0 2 0 0 0 0 0 0]
[0 0 0 1 1 0 1 0 3 0]
[0 0 0 2 0 0 0 0 4 1]
[0 0 0 0 0 0 0 0 1 3]]
SVC accuracy is: 18.64%
/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/svm/_base.py:301: ConvergenceWarning: Solver terminated early (max_iter=251). Consider pre-processing your data with StandardScaler or MinMaxScaler. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
NBclassifier2 = GaussianNB()
NBclassifier2.fit(x_train, y_train)
y_pred = NBclassifier2.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
NBAcc2 = accuracy_score(y_pred,y_test)
print('Gaussian Naive Bayes accuracy is: {:.2f}%'.format(NBAcc2*100))
precision recall f1-score support
1 0.00 0.00 0.00 4
2 0.00 0.00 0.00 5
3 0.00 0.00 0.00 7
4 0.11 0.38 0.17 8
5 0.00 0.00 0.00 6
6 0.00 0.00 0.00 10
7 0.00 0.00 0.00 2
8 0.00 0.00 0.00 6
9 0.21 0.71 0.32 7
10 0.00 0.00 0.00 0
11 0.00 0.00 0.00 4
accuracy 0.14 59
macro avg 0.03 0.10 0.04 59
weighted avg 0.04 0.14 0.06 59
[[0 0 0 3 0 0 0 0 1 0 0]
[0 0 0 2 0 0 1 0 2 0 0]
[0 0 0 3 0 0 1 0 2 1 0]
[0 0 0 3 0 0 0 0 4 1 0]
[0 0 0 5 0 0 0 0 1 0 0]
[0 0 0 2 0 0 0 0 6 2 0]
[0 0 0 2 0 0 0 0 0 0 0]
[0 0 0 2 0 0 1 0 3 0 0]
[0 0 0 2 0 0 0 0 5 0 0]
[0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 3 0 0 1 0 0 0 0]]
Gaussian Naive Bayes accuracy is: 13.56%
/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
DTclassifier = DecisionTreeClassifier(max_leaf_nodes=20)
DTclassifier = DTclassifier.fit(x_train, y_train)
y_pred = DTclassifier.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
DTAcc = accuracy_score(y_pred,y_test)
print('Decision Tree accuracy is: {:.2f}%'.format(DTAcc*100))
precision recall f1-score support
1 0.22 0.50 0.31 4
2 0.00 0.00 0.00 5
3 0.33 0.14 0.20 7
4 0.00 0.00 0.00 8
5 0.00 0.00 0.00 6
6 0.20 0.10 0.13 10
7 0.00 0.00 0.00 2
8 0.00 0.00 0.00 6
9 0.00 0.00 0.00 7
10 0.00 0.00 0.00 0
11 0.60 0.75 0.67 4
accuracy 0.12 59
macro avg 0.12 0.14 0.12 59
weighted avg 0.13 0.12 0.11 59
[[2 0 0 0 1 0 0 0 1 0 0]
[1 0 0 0 1 1 1 0 1 0 0]
[0 0 1 0 1 1 0 0 3 1 0]
[0 0 1 0 1 1 0 0 4 0 1]
[4 0 0 0 0 1 1 0 0 0 0]
[1 1 0 0 1 1 0 0 4 2 0]
[1 0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 3 0 0 0 3 0 0]
[0 0 0 0 1 0 3 0 0 2 1]
[0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 3]]
Decision Tree accuracy is: 11.86%
/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
fig, axes = plt.subplots(nrows = 1,ncols = 1, figsize = (20,20), dpi=600)
tree.plot_tree(DTclassifier, max_depth = 20, feature_names = X.columns, filled=True)
plt.show()
fi = DTclassifier.feature_importances_ #feature importance array
fi = pd.Series(data = fi, index = X.columns) #convert to Pandas series for plotting
fi.sort_values(ascending=False, inplace=True) #sort descending
#create bar plot
plt.figure(figsize=(25, 20))
chart = sns.barplot(x=fi, y=fi.index, palette=sns.color_palette("mako_r", n_colors=len(fi)))
chart.set_xticklabels(chart.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.show()
/var/folders/24/536gs7r91qzd964t2ppqhs6m0000gn/T/ipykernel_31475/2187630576.py:4: UserWarning: FixedFormatter should only be used together with FixedLocator
RFclassifier = RandomForestClassifier(max_leaf_nodes=30)
RFclassifier.fit(x_train, y_train)
y_pred = RFclassifier.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
RFAcc = accuracy_score(y_pred,y_test)
print('Random Forest accuracy is: {:.2f}%'.format(RFAcc*100))
precision recall f1-score support
1 0.14 0.25 0.18 4
2 0.00 0.00 0.00 5
3 0.00 0.00 0.00 7
4 0.25 0.12 0.17 8
5 0.11 0.17 0.13 6
6 0.00 0.00 0.00 10
7 0.00 0.00 0.00 2
8 0.00 0.00 0.00 6
9 0.07 0.14 0.10 7
11 0.38 0.75 0.50 4
accuracy 0.12 59
macro avg 0.10 0.14 0.11 59
weighted avg 0.09 0.12 0.09 59
[[1 0 1 0 0 0 0 0 1 1]
[2 0 1 0 0 0 0 0 2 0]
[0 0 0 2 0 0 1 0 3 1]
[0 0 0 1 2 0 0 2 2 1]
[3 0 0 0 1 0 0 1 1 0]
[1 1 0 0 2 0 2 0 4 0]
[0 0 0 1 0 1 0 0 0 0]
[0 0 0 0 4 0 2 0 0 0]
[0 0 0 0 0 0 4 0 1 2]
[0 0 0 0 0 0 0 1 0 3]]
Random Forest accuracy is: 11.86%
compare = pd.DataFrame({'Model': ['Logistic Regression', 'K Neighbors', 'SVM', 'Gaussian NB', 'Decision Tree', 'Random Forest'],
'Accuracy': [LRAcc*100, KNAcc*100, SVCAcc*100, NBAcc2*100, DTAcc*100, RFAcc*100]})
compare.sort_values(by='Accuracy', ascending=False)
| Model | Accuracy | |
|---|---|---|
| 0 | Logistic Regression | 18.644068 |
| 2 | SVM | 18.644068 |
| 3 | Gaussian NB | 13.559322 |
| 4 | Decision Tree | 11.864407 |
| 5 | Random Forest | 11.864407 |
| 1 | K Neighbors | 10.169492 |
sns.set_theme(style="darkgrid")
sns.barplot(data=compare.sort_values(by='Accuracy', ascending=False), x='Model', y='Accuracy', palette="mako_r")
plt.ylabel('Accuracy Percentage')
plt.xlabel('Model')
plt.title('Important Feature Model Accuracy')
plt.show()
The models appear to be significantly less accurate after dropping the unimportant features.